GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

نویسندگان

Seyed M. Mirtaheri

Gregor von Bochmann

Guy-Vincent Jourdan

Iosif-Viorel Onut

چکیده

Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, for which good and efficient solution are known. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is an open problem. Technologies such as AJAX and partial Document Object Model (DOM) updates only make the problem of crawling RIA more time consuming to the web crawler. One way to reduce the time to crawl a RIA is to crawl a RIA in parallel with multiple computers. Previously published Dist-RIA Crawler presents a distributed breath-first search algorithm to crawl RIAs. This paper expands Dist-RIA Crawler in two ways. First, it introduces an adaptive load-balancing algorithm that enables the crawler to learn about the speed of the nodes and adapt to changes, thus better utilize the resources. Second, it present a distributed greedy algorithm to crawl a RIA in parallel, called GDist-RIA Crawler. The GDist-RIA Crawler uses a server-client architecture where the server dispatched crawling jobs to the crawling clients. This paper illustrates a prototype implementation of the GDist-RIA Crawler, explains some of the techniques used to implement the prototype and inspects empirical performance measurements.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recombination Operators in Genetic Algorithm - Based Crawler: Study and Experimental Appraisal

A focused crawler traverses the web selecting out relevant pages according to a predefined topic. While browsing the internet it is difficult to identify relevant pages and predict which links lead to high quality pages. This paper proposes a topical crawler for Vietnamese web pages using greedy heuristic and genetic algorithms. Our crawler based on genetic algorithms uses different recombinati...

متن کامل

Building Data-Intensive Grid Applications with Globus Toolkit - An Evaluation Based on Web Crawling

Nowadays, there is a trend to create resource-consuming applications without building heavy computer centers, but to use resources on computer systems distributed over the internet. Grid middleware is a framework to access these resources. The concern of this paper is the evaluation of a specific grid middleware, namely Globus Toolkit, for data-intensive applications. As a test case, we have de...

متن کامل

Enabling automatic testing of Modern Web Applications using Testing Plug-ins

Modern web applications are very dynamic in nature with rich user experience. Such applications typically use Web 2.0 and Asynchronous JavaScript and XML (AJAX) technologies. These applications are very different from conventional web applications as they use stateful C/S communication in an asynchronous fashion. The use agent is able to communicate with web server without explicit form submiss...

متن کامل

Web Crawler: A Review

Information Retrieval deals with searching and retrieving information within the documents and it also searches the online databases and internet. Web crawler is defined as a program or software which traverses the Web and downloads web documents in a methodical, automated manner. Based on the type of knowledge, web crawler is usually divided in three types of crawling techniques: General Purpo...

متن کامل

Search optimization technique for Domain Specific Parallel Crawler

Architectural framework of World Wide Web is used for accessing linked documents spread out over millions of machines all over the Internet. Web is a system that makes exchange of data on the internet easy and efficient. Due to the exponential growth of web, it has become a challenge to traverse all URLs in the web documents and handle these documents, so it is necessary to optimize the paralle...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

نویسندگان

چکیده

منابع مشابه

Recombination Operators in Genetic Algorithm - Based Crawler: Study and Experimental Appraisal

Building Data-Intensive Grid Applications with Globus Toolkit - An Evaluation Based on Web Crawling

Enabling automatic testing of Modern Web Applications using Testing Plug-ins

Web Crawler: A Review

Search optimization technique for Domain Specific Parallel Crawler

عنوان ژورنال:

اشتراک گذاری